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Abstract A robust estimator for a wide family of mixtures of linear regression is pre¬ 
sented. Robustness is based on the joint adoption of the Cluster Weighted Model and 
of an estimator based on trimming and restrictions. The selected model provides the 
conditional distribution of the response for each group, as in mixtures of regression, 
and further supplies local distributions for the explanatory variables. A novel version 
of the restrictions has been devised, under this model, for separately controlling the 
two sources of variability identified in it. This proposal avoids singularities in the 
log-likelihood, caused by approximate local collinearity in the explanatory variables 
or local exact fits in regressions, and reduces the occurrence of spurious local maxi¬ 
mizers. In a natural way, due to the interaction between the model and the estimator, 
the procedure is able to resist the harmful influence of bad leverage points along the 
estimation of the mixture of regressions, which is still an open issue in the literature. 
The given methodology defines a well-posed statistical problem, whose estimator ex¬ 
ists and is consistent to the corresponding solution of the population optimum, under 
widely general conditions. A feasible EM algorithm has also been provided to obtain 
the corresponding estimation. Many simulated examples and two real datasets have 
been chosen to show the ability of the procedure, on the one hand, to detect anoma¬ 
lous data, and, on the other hand, to identify the real cluster regressions without the 
influence of contamination. 
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1 Introduction 


Mixture models provide a quite flexible approach to statistical modeling of a wide va¬ 
riety of random phenomena, whenever we can reasonably suppose that the observa¬ 
tions arise from unobserved groups in the population. Under this general framework, 
the present paper provides a new proposal in the fami ly of finite mixtures of robust 
regressions dPeSarbo and Cron . 1988 : de Veaux . 1989h . 

Assume we are provided with two quantitative random variables X and V: X 
is a vector of explanatory variables, F is a response or outcome variable, and the 
dependence between Y and X may vary among the different underlying groups. By 
adopting the cluster-weighted approach, we allow different scatter structures in each 
group, both in the marginal distribution of X and in the conditional distribution of 
F|X = X, as it is req uired by many obser ved dataset. The Cluster Weighted Model 
(CWM), introduced in Gershenfeld ( 19971) . decomposes the joint p.d.f. of (X, Y) in 
each component of the mixture as the product of the marginal and the conditional 
distributions. 

Due to its very definition, the CWM estimator is able to take into account different 
distributions for the explanatory variables across groups, so overcoming an intrinsic 
limitation of mixtures of regression, where they are implicitly assumed equally dis¬ 
tributed. However, due to the possible presence of contaminating data (background 
noise, pointwise contamination, unexpected minority patterns, etc.) a small frac¬ 
tion of outliers could severely affect the model fitting. Among the available stan¬ 
dard techniques in robust estimation, those based on removing part of the data - and 
called impartial trimming procedures - present a good performance, often being an 
obligatory benchmark to compare new estimators. Successf ul robust procedures of 
this k ind are, for instance, t he LTS for regression mode ls ( Rousseeuw and Lerovl 


19871). the trimmed k-mean s ( Cuesta-Albertos et al. . 1997 ). the TCLUST for cluster¬ 


ing dGMma;Escudero_etaUj200®, and the robust clusterwise linear regression mod¬ 
els ( Garcfa-Escudero et al. , 2010l) Here, in the framework of mixtures of regressions, 
denoting by x and y the realizations of X and Y, standard diagnostic tools can eas¬ 
ily identify outliers on y that fall in the range of values of x, while the detection of 
outliers on both x and y, that may act as bad leverage points, is much more problem¬ 
atic. Many trimming approaches are effective for the first type of outliers, but they 
fail when dealing with bad leverage points. In this paper, we exploit the CWM nice 
feature of modeling the X marginal distribution, to detect dangerous outliers on x. At 
the same time, we also use the regression structure among X and Y to deal with out¬ 
liers on y. In this way, by robustifying the CWM estimation, we can simultaneously 
handle both type of outliers with the same formal approach. As usual when using 
trimming, only the total fraction of discarded observations must be fixed in advance. 

A further issue with ML estimation for CWMs is the unboundedn ess of the 
log-likelihood function, a well-known aspe ct pointed out in Day ( 19691) for Gaus¬ 
sian mixtures. To overcome this drawback. iHathawavi (Il98-5h introduced the use of 
constrained variance estimation in univariate mixture modeli ng. These restrictions 
have b e en extended to the multivari ate c ase in different ways bylMcLachlan and Peel 
(2004), Ingrassia and Rocci (2007) and Garcfa-Escudero et al. (2008). By adopting 
restrictions also for CWM, we arrive at setting a well-posed optimization problem. 
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Additionally, a restricted approach not only avoids si ngularities, it also discards non - 
interesting local maximizers of the objective function (iGarcfa-Escudero et al.Ll2014bl) . 
We will discuss in detail how approximate local collinearity in the explanatory vari¬ 
ables, and approximate local exact fits in the regressions may cause, indeed, serious 
troubles in CWMs. 

The above considerations give rise to the robust estimation of the trimmed Clus¬ 
ter Weighted Restricted Model (trimmed CWRM) presented hereafter. It includes an 
original application of the constraints, which takes into account the specific features 
of CWM and controls the relative variability between components for the sources 
of variability in the model corresponding to: i) the explanatory variables, and ii) the 
regression errors. The CWM, endowed with restrictions and trimming, becomes a 
very competitive robust estimator for mixtures of multiple regression, with optimal 
statistical properties. 

We have organized the paper as follows. In Section |2] we recall the main ideas 
about the CWM. In Section|3]we present the trimmed CWRM, and introduce a feasi¬ 
ble algorithm for its practical implementation. Then, we state the central findings of 
the paper, i.e. the existence and the strong consistency of the new estimator. Section 
IHprovides a discussion on the effects of constraints and trimming, along with some 
illustrative examples. The application of the proposed methodology to two real data 
sets is shown in Section |5] Finally, Section |6] contains some concluding remarks and 
sketches future research. Proofs and technical lemmas needed for our main results 
are relegated in the Appendix. 


2 Cluster Weighted Modeling 


The Cluster Weighted Model (CWM) has been proposed in the context of media tech¬ 
nology, to build a digital violin wit h traditional inp uts and realistic sounds ( Gershenfeldl 
199"^ Gershenfeld et akl 19991) : in Wedel ( 2000l). CWMs are r eferred to as the fam¬ 
ily of saturated mixture regression models. In Ingrassia et alJ ( 2012 ). CWMs have 
been reformulated in a statistical s etting showing t hat the y are a general and flexible 
family of mixture models. In fact, Ingrassia et al. ( 20121) show that Gaussian CWM 
includes, as special cases, finite mixtures of distributions and finite Mixtures of Re¬ 
gression models. 

Let (X, Y) be a pair of random variables, namely a vector of covariates X and a 
response variable Y defined on 17 with values in A x 3^ C x R and {(x^, yi)}2=i 
represents a i.i.d. random sample of size n, drawn from (X, Y). Let p(x, y) denote 
the joint density of (X, Y), and suppose that 17 can be partitioned into G groups, say 
l7i,..., 17(3 ■ CWMs are mixture models having density of the form 


G 


p(x,y;0) = y p{y\x]^ )p{x]xp )TTg 


( 1 ) 


3=1 


where p(t/|x; ^g) is the conditional density of Y given x in 17^ (depending on some 

parameter ^g), p(x; if>g) is the marginal density of X in I7g (depending on some pa- 

_ 

rameter ipg) and tt^ is the weight of I7g in the mixture (with TTg > 0 and % = 
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1). Furthermore, we assume that in each group fig, the conditional expectation of 
Y given X = x, is a function m(-) of x depending on some parameters j3g, that is 
F;(y|x,f2g) = m(x;/3g). 

In this work, we have focused on models of type O with Gaussian components. 
Thus p(x; ipg) = fig, ^g), where fig, Sg) denotes the density of the d- 
variate Gaussian distribution with mean vector fig and covariance matrix Sg. More¬ 
over, we have assumed that the conditional relationship between Y and x in the g- 
th group can be written as F = b^x + + Eg where Eg ^ N{0,a^. Hence, 

X|f?g ^ Ndifig, Sg) and F|x, [2g - iV(b;x + b°, a^), so that model Q special¬ 
izes to: 

G 

p(x, y; 0) = ^ b^x -f 6°, ag)(j)d{x-, fig, Sg)T^g, (2) 

3=1 

which defines the linear Gaussian CWM. We notice here that definition dU corre¬ 
sponds to a mixture of regressions, with weights (j)d{^', fig, ^g)'^g depending also 
on the covariate distributions in each component g for g = 1,... ,G. Finally, in the 
framework of model-based clustering, each unit is assigned to one group, based on 
the maximum a posteriori probability. The consideration of dU yields to the use of 
(log-)likelihood target function to be maximized as 


Ulog 


■ G 

'^(t){yi-,h'gy., + bl,(Tl)(j)d{y.^-, fig, S g)TTg 

- 3=1 


( 3 ) 


For sake of simplicity, we will later use the notation 

Dg{yL, y, e) = (j){y, b^x -f b°, crg)(/)d(x; fig, I!g)T:g 

and D{x, y; 6) = X]g=i V, where the set of all parameters of the model 

is denoted by 6, and, such that dS is simply rewritten as ^ 27=1 
Additionally, the linear Gaussian CWM will be many times simply referred to as 
CWM. 


2.1 Two problems about CWM 


The estimation of the (linear Gaussian) CWM suffers from a serious lack of robust¬ 
ness, like it happens when using many oth er models bas ed on normal assumptions 
and fitted through ML estimators (see, e.g.. lHuberlll98lh . It is very important to be 
aware of this issue, due to the common presence of noise sources in data. To illus¬ 
trate this problem, a simulated data set of n = 180 units (referred to as Simdatal 
hereafter), has been generated from the CWM with G = 2 and 90 observations from 
each component. Then we added 20 contaminating observations as either background 
noise, see Figureflla), or pointwise contamination around the point (15, 20), see Fig- 
urelTJb). The true underlying regression lines (prior to contamination) are represented 
with dotted lines, and we can see the dangerous effects of outliers on model fitting 
for the standard CWM. 
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(a) (b) 




Fig. 1 Simdatal: (a) original data plus background noise and CWM fitted; (b) original data plus pointwise 
contamination and CWM fitted; (c) and (d) show the fitted ti'immed CWRMs with a = 0.1, cx = Ce = 
20 to these two data sets. The dotted lines represent the true regression lines to be estimated and black 
circles are the trimmed observations (here and in all the figures). 


Another important issue concerns the unboundedness of the target function in Q 
when no constraints are imposed on the scatter parameters. In this case, the defining 
problem is ill-posed because the loglikelihood in Q tends to cjo when either 
and I Sg \ ^ 0 or yi = h'gKi + bg and cr^ —0. Moreover, as a trivial consequence, the 
EM algorithms often applied to fit a CWM can be trapped into non-interesting local 
maximizers, called “spurious” solutions, and the result of the EM algorithm strongly 
depends on its initialization. 

Spurious solutions may be due to very localized patterns in the explanatory vari¬ 
ables, as shown in Eigure |2a), by considering a second simulated data set {Sim- 
data2). Here, data concern n = 200 observations and d = 2 explanatory variables. 
The dataset has been built as follows: two sets of 90 observations for the explanatory 
variable X has been drawn from two bivariate normal distributions, centered at (2, 2) 
and (4,4), respectively. Then, 20 almost collinear observations have been added to 
the sample, close to the second component. The values for the response variable Y 
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have been generated by using the same linear function (for both components) with 
equally distributed error terms. We can see in Figure|2a) that the standard fit of the 
CWM yields to the determination of a first spurious component with the 20 almost 
collinear observations and a second component joining together the two groups, with 
90% of the observations. 

Sometimes spurious solutions may be also due to localized patterns of obser¬ 
vations, where an approximate “exact fit” for a small number of observations can be 
obtained. Figure[3]shows a third simulated data set (SimdataS) with n = 200 observa¬ 
tions, where 196 of them have been generated from a CWM with G = 2 components 
(98 observations from each component). A very small fraction of almost collinear 
units (only 4 observations) on the (X, Y) variables have been added, with a roughly 
equal value (around 0) for the response variable. These values, for instance, could be 
due to a bad performance of the tool used to measure the response variable. It may be 
seen that a fitted component including only these almost collinear observations could 
arise, along the EM estimation, because a small value of one of the cr^ parameters 
yields to higher values of the log-likelihood. Then, the two main linear structures 
accounting for 98% of the data points would be artificially joined together. 

To overcome the previous issues, in the next section we propose a robust method¬ 
ology by incorporating trimming and constraints to the CWM. 


3 Trimmed Cluster Weighted Restricted Modeling 

3.1 Problem statement 


For a given sample of n observations, the trimmed CWRM methodology is based on 
the maximization of the following log-likelihood function 


n 

^z(xi,y,)log 

2=1 


■ G 

^(l){yi;h'gyi, + b°,al)(l)d{yii]fJ-g,Sg)TTg , 
- 3=1 


(4) 


where z(-, •) is a 0-1 trimming indicator function that tell us whether observation 
(xi, yi) is trimmed off {z{xi^ yi)=0), or not (z{xi, yi)=l). A fixed fraction a of ob¬ 
servations can be unassigned by setting = [' 11(1 “ d)]- Hence the 

parameter a denotes the trimming level. Analogous approaches based on tr i mmed 


jg ous approaches based on tr i mmed 
ill (l200'^ . Gallegos and Rittei ( 20091) 


mix ture likelihoods can be found in iNevkov et al.l 
and Garcfa-Escudero et al. ( 2014bh . 

Moreover, we introduce two further constraints on the maximization in (HJl. The 
first one concerns the set of eigenvalues {A;(X'g)}i=i_ of the scatter matrices Sg 
by imposing 


AiiiXgJ < cxAiJXgJ for every I < h ^ h < dwAl < gi ^ g 2 < G. 

(5) 

The second constraint refers to the variances of the regression error terms, by 
requiring 

o-gi < CeCTg^ for every l<gi^g 2 <G. (6) 
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The constants cx and c^, in (|5]l and (|6ll respectively, are finite (not necessarily equal) 
real numbers, such that cx > 1,C£ > 1. They automatically guarantee that we 
are avoiding the \Sg\ 0 and a'j 0 cases. These cons t raints are an extension 
toCWMs of those introduced in Ingrassia and Roccil (12007 f Garcfa-Escudero et al 


(l2008h and iGreselin and Ingrassial (1201 Oh and go back to lHathawavl (Il985h . The main 
difference is the asymmetric and different treatment given by the constraints, when 
modeling the marginal distribution X or when modeling the regression error terms, 
providing high flexibility to the model. 

Let us consider now the effects of trimming in the two data sets derived from 
Simdatal. In Figure[Tfc) and (d) we can see that setting a = 0.1 allows to restore the 
true structure of the data, by discarding the outlying observations, both in the case of 
background noise and pointwise contamination. Hence, trimming modifies the ML 
estimation in such a way that it is no more influenced by potential outliers and drives 
it far from the previous bad results. 

Commenting the use of constraints, we can see how a moderate choice of cx for 
Simdatal in Figure ^h) allows to correctly detect the G — 2 main groups and to 
avoid the disturbing effect of the spurious patterns in the explanatory variables. 

Additionally, we can see that a moderate choice of Cg for SimdataS would also 
allow to correctly detect the G = 2 main groups. Moreover, we can see in Figure 
12 a) how only considering a = 0.02 trimming level (trying to discard the 4 outlying 
observations in SimdataS) does not solve the problem at all without the consideration 
of a moderate value of Cg. 

A detailed discussion about the role played by a, cx and is given in Section|4] 


3.2 Theoretical results 

The problem stated in Section [TTI admits a population counterpart. Let P = P{x.,y) 
be the probability measure in induced by the joint distribution of the random 
variables X and Y and let Ep{-) denote the expectation with respect to P. Let O^x .c^ 
denote hereafter the set of all possible 9 which do satisfy constraints (01 and (0 
for given constants cx and c^. With this notation, the population problem is defined 
through the double maximization of Ep[ log i7(X, Y ; 6)1 a (X, F)] over all possible 
6 G Ocx.ce’ possible subsets A C with P[A] > 1 — a. As usual, 

Ia{') denotes the indicator function of set A. We will see that the optimal set A can 
be determined directly from 6. In more detail, fixed 6, and denoting by 

P) = sup {w : P[(X, Y) : D(X, Y;e) > u] > 1 - a}, 

U 

then A is given by A{6) = A{6, P) = {(x, y) : D{x, y; 6) > R(6, P)}. Therefore, 
we reduce the population problem to that of maximizing 

L(0,P) = Pp[logP(X,r;0)/^(e^(X,y)], onO^G^xPe (7) 

Note that we recover the original sample problem introduced in Section 13.11 
just by taking P equal to the empirical measure and setting 
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Fig. 2 Simdatal: Scatter plot matrix, (a) Almost collinear observations in the explanatory variables which 
are found as a cluster by CWM when G = 2; (b) Results of fitting the trimmed CWRM with a = 0, 
cx = Ce = 20 . 


z{xi, Hi) = Hi) for the optimal set A. The way that the optimal set A is ob¬ 

tained from 9 will be also used in the C-steps of the algorithm to be presented in 
Section [33] 

In this section, we present results guaranteeing the existence of the solutions for 
both the sample and the population problem. Moreover, we state the consistency 
of the sample solution to the population one. These results are derived under very 
mild assumptions on the underlying distribution P. In fact, no moment conditions 
are needed on P and, thus, the proposed methodology can be applied even to heavy¬ 
tailed distributions. We will only exclude for P some “pathological” cases that are 
clearly non appropriate, namely; 
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(a) 




Fig. 3 SimdataS: (a) Results of the trimmed CWRM fit with G = 2, ol = 0.02 and Ce = 10^*^ (almost 
unrestricted) showing the detection of a spurious component due to an approximate “local exact fit” in one 
of the fitted regressions; (b) results with a. = 0.02, cx = Ce = 20. 


(PR) The support of P is not concentrated on G regression hyperplanes and 
the support of X is not concentrated in G points in after removing a 
probability mass equal to a, 

where we say that S C is concentrated in a “regression hyperplane” if an 

“exact fit” property holds for some W and b in such a way that y = b'x + 6° for 
all (x, y) S S. The previous condition holds for absolutely continuous distribution P 
as well as empirical measures P„ obtained from absolutely continuous distributions 
when n is large enough. 

Proposition 3.2.1 If (PR) holds for P, then there exists 6 G 0cx maximizing 

L{e,p). 

The underlying distribution P is typically unknown and we often only rely on the 
result of a random sample from P. Let denote the solution of the sample problem 
for a random sample of size n. If the population problem has a unique solution 6q, 
then the following property states that should be close to 6q when n is large. 
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Proposition 3.2.2 Assume that P be an absolutely continuous distribution with strictly 
positive density function satisfying (PR) and that 6 q is the unique maximizer of 
L(e,P)fore G 1 C Ocx,ce o sequence of maximizers of (O 

when P is replaced by the sequence of empirical measures {Pn}'^=i, referred to a 
sequence ofi.i.d. samples from P, then On —^ Oq almost surely. 


Note that, apart from the (PR) condition, a uniqueness condition is also needed to 
get consistency. It is also important to note that the parameters obtained by solving 
the maximization (|7]i do not necessarily coincide with the parameters of the mixture 
components appearing in the definition of the (uncontaminated) CWM. However, we 
conjecture that these two different types of parameters are “close” each other when¬ 
ever the contamination is not very overlapped with the most interior regions of the 
mixture components and when a, cx and are “properly” chosen. However, estab¬ 
lishing results formalizing this idea is not an easy task (as happens even in simpler 
clustering approaches). 

Although the proofs o f these theoretical resu l ts, giv en in the Appendix, are re 


lated to previous works in iGarcfa-Escudero et al.l d2008h and iGarcia-Escudero et al 


( 2014ah . several specific technicalities must be sorted out for the present case. In fact, 
these technicalities are far from being straightforward and mainly have to do with 
how to deal with the effect of “local collinearities” in the regression coefficients. 


3.3 Algorithm 


The constrained maximization of the trimmed log-likelihood in (|4]i on its parame¬ 
ters is not an easy task. In this section, we present a feasible algorithm obtained by 
combining th e EM algorithm for CWM wi th that (wi th trimming and constraints) 
introduced in Garcia-Escudero et al. ( 2014b ) (see, also, Eritz et all 2013 ): 


1. Initialization: The algorithm is initialized several times by selecting different ini¬ 


tial 6 

.( 0 ) 


( 0 ) _ 


_ /_(0) lu; lu; luj 


( 0 ) ,.( 0 ) 


.(0) v.(o) 


bg'', ..., After drawing d 


v(o) ^0(0) 




2 distinct observations for each group, 
we compute their sample means and sample covariance matrices as initial values 
for and .S'®. Additionally, G ordinary least square regressions are carried 

out to obtain initial and bg°^ regression parameters (G-inverse matrices are 
used if needed). The mean square errors of the G regressions are used to de¬ 
termine the initial values. If IT® and/or do not satisfy the required 
constraints (|5]l and (|6]) then the procedure that will be described in Step 2.2 is 
applied to enforce them. Einally, weights ..., tt® in the interval (0,1) and 
summing up to 1 are randomly chosen. 

2. Trimmed EM steps: Starting from each random initialization the following 
steps are alternatively executed until convergence or until a maximum number of 
iterations is reached. The implementation of trimming is clearly related to how 
“concentration” steps ( C-steps) are carried out to implement high-breakdown ro¬ 
bust methods (see, e.g., Rousseeuw and Van Driessen . 19991) . 
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2.1. E- and C-steps: Let 0*-*^ be the parameters at iteration I, we compute Di = 

= 1 ,... , n. After sorting these values, the notation D(i) < 
.... < Ll(n) is adopted. Let us consider the subset of indices I C {1, 2,..., n} 
defined as / = {i : > i2([„a])}- To update the parameters, we will 

take into account only the observations with indices in I, by setting = 
Dg{xi,yi;6^''^)/D{xi,yi;6^''^) for * € / and = 0 for * ^ Note that 

rfg, for the observations with indices in /, are the usual “posterior probabili¬ 
ties” in the standard EM algorithm. 

2.2. M-step: From these Tig values, we update the weight and mean parameters as 

n n j 71 

- a)] and ^ ^ r®. 

i=l i=l ' i=l 


The other parameters (regression and scatter ones) are initially updated by 



Along the iterations, due to the updates, it may happen that the Tg matrices 
and the values do not satisfy the required constraints for the scatter param¬ 
eters. 

To perform a constrained maximization of the sample covariance matrices, the 
singular-value decomposition of Tg = UgEgllg is considered, with Ug being 
an orthogonal matrix and Eg = diag(egi, eg 2 , ^gd) a diagonal matrix. Af¬ 
ter defining the truncated eigenvalues as [Cgi]^ = min [cx-m, max(eg;, m)), 
with m being some threshold value, then the scatter matrices are finally up¬ 


dated as = t/'L;;t7g, with E*g = diag ([egi]^x , [eg 2 ] 

and minimizing the real valued function 


X, [e-c 

opt 


m !->■ ^ ’’■g 
9=1 


d 

E 

z=i 


log {[egl]^) + 


^gi 


[egi] 


X 




( 8 ) 


Analogously, in case that the parameters do not satisfy the constraint (|6l), 
we consider the truncated variances[s^]^ = min (c^ • m, max(Sg, m)). The 
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variances of the error terms are finally updated as dg = [Sg]f^e , with 
TTiopt minimizing the real valued function 



(9) 


Proposition 3.2 in iFritz et al.l (120 13h shows that and can be obtained, 
respectively, by evaluating 2dG + 1 times the real valued function in (O and 
2G + 1 times the real valued function in (|9l). 


3. Choosing the best obtained solution: When the stopping criterium has been met, 
the value of the target function (|4|i is computed. The parameters yielding the high¬ 
est value of the target function are returned as the final output of the algorithm. 


4 Constraints and trimming 

4.1 Effect of constraints 

The parameter cx controls the differences among scatters for the normal distributions 
used as mixture components when modeling the vector of covariates X. It also con¬ 
trols the deviations from sphericity in the multivariate case (d > 1). As cx < oo, we 
are avoiding that | Sg \ becomes arbitrarily small, assuring a bounded contribution of 
\ fJ-g, X'g) to the log-likelihood function in (|4|i. Moreover, a moderate value of 
Cx avoids the detection of spurious solutions, like in the case exemplified in Figure 
12] If we set Cx = 1, then we force the covariance matrices to satisfy the relation 
Si = ... = Sc = aid with a > 0 and Id being the identity matrix in On the 
other hand, the larger the value of cx, the larger the differences among covariance 
matrices modeling the mixture components of X could be. 

For instance, consider the simulated data Simdatad in Figure]!] which is modeled 
according to either cx — ^ or cx = 20, see Figure l4](a) and (b) respectively. Note 
that the component variances {Si and S 2 are positive real values because d = 1) 
are forced to be equal, i.e.: Si = S 2 in (a), while T[iax{Si/S 2 , S 2 /Si] < 20 
holds in (b). The densities of the normal distributions considered in the fitted mixture 
to model the X distribution are also represented below, to illustrate their variances. 

Our recommendation is to take cjc > 1 without selecting huge values for it. A 
sensible choice, for instance, is cx = 20, as it worked fairly well in most of the cases 
we observed in practice, if the explanatory variables are in similar scales. 

On the other hand, the constant Cg represents the maximum ratio among the vari¬ 
ances of the regression error terms. Even if the ME estimation would be attracted 
by solutions in which some 0, due to their high contribution by means of 

ipivi', bgXi -I- lPg,(j^) to the maximization of the log-likelihood in (0]), a choice of 
Ce < 00 avoids that the algorithm fall into singularities. Enforcing a value Cg = 1 im¬ 
poses the strongest constraint tr^ = ... = tr^. For instance, let us consider SimdataS 
in Figure]!] which has been generated from a CWM with al = 0.5^ and al = 0.1^ 
{a\la\ = 25). The results of fitting the trimmed CWRM for this data set are also 
shown with bands. Indeed, in specific applications, it is useful to take into account 
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(a) (b) 




Fig. 4 Simdata4\ (a) Results for cx = that forces equal scatters in the marginal distribution (the plotted 
densities, in the lower part of the figure, represent the normal fitted components); (b) Results for cx = 20, 
that allows different scatters. In both cases, a = 0.1 and = 20 have been chosen. 


such bands, centered at the fitted regression lines and with amplitudes given by ±2crg, 
i.e. twice the estimated standard deviations of the regression error terms. A first so¬ 
lution corresponding to Ce = 1 < 25 is given in Figure |5] (a), while a second one 
corresponding to Cg = 50 > 25 is given in panel|5lb). Notice the different amplitude 
of these bands. However, although different scatters can be effective in many cases, a 
huge difference between them is not recommended, as it can lead to fit a few almost 
collinear observations. 


An important feature of the proposed methodology is to provide a different con¬ 
straint for the eigenvalues of the matrices Sg and for the variances of the error terms 
CTg. This allows to deal with different scales in the explanatory and response vari¬ 
ables, which is common in many applications. On the other hand, the procedure is 
not fully affine equivariant in the explanatory variables, due to the considered con¬ 
straints. However, if needed, it is close to affine equivariance for large values of ex¬ 


it is well known, see e.g. lingrassia et al.l (12012h . that the linear Gaussian CWM 
may be seen as included in the finite mixture of Gaussian distributions when embed¬ 
ding it into a d-l-1 dimensional space. Also in the latter case, constraints are needed to 
avoid singularities and to reduce the detection of spurious solutions. However, con¬ 
straints giving a completely symmetric handling of the variability for the explana¬ 
tory variables and for the error terms are not always the best idea. For instance, as 
a way to provide robustness, we could have considered the TCLUST methodology 


(Garcfa-Escudero et al 


120081) in the d -\-l dimensional space which needs the spec¬ 
ification of a constant c > 1 to constraint the maximal ratio among the G x (d -|-1) 
eigenvalues. Unfortunately, Mixture of Regressions problems often require very high 
values for the constant c which do not always guarantee TCLUST to be correctly 
protected against spurious solutions. 
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(a) (b) 




Fig. 5 SimdataS: (a) Results for Ce = 1, forcing equal variances in the error terms, (b) Results for a larger 
Cg = 20 value. In both cases, o; = 0.1 and cx = 20 have been chosen and bands of amplitude :t2ag are 
sho\vn. 


To illustrate the previous claims, let us consider Simdata6, of size n = 200, 
where 180 observations have been generated from a CWM with two groups, and 20 
observation have been included as concentrated noise. The data set is plotted in Figure 
|6l where panel (a) shows the results of applying the TCLUST methodology with 
c = 1.5 in dimension d + 1 = 2. We can see that the results are not satisfactory (the 
analogous of the regression lines are the axes corresponding to the largest eigenvalue 
of the Sg matrices) and, therefore, higher c values seem to be needed. But, higher c 
values often yield the detection of undesired spurious solutions. For instance, panel 
(b) shows the results of applying TCLUST with c = 500 with the detection of a 
cluster only containing all noisy observations. On the other hand, we can see that a 
proper fit is obtained in panel (c), when applying the trimmed CWRM with cx = 
Ce = 1.5. 


It is worthy to note that asymmetric constraints also underlies so me parameteriza- 
tions a lready proposed in closely related problems as, for instance, in lPasgunta and Rafterv 
(Il998h where the eigenvalues of the scatter matrices corresponding to the {d + 1)- 
dimensional fitted mixture components are requested to be \g x {1, a,..., a] with 
a « 1 . 


4.2 Effect of trimming 

We start from the well-known Mixture of Regressions model and first consider an 
easier trimming approach based on the maximization of 

n ~ G 

^z(x„y,)log ^(^(t/*;b^Xi + 6°,cr2)7rg , 

i=l U=1 


(10) 
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(a) 


• • 1 . 







(b) 



(c) 



Fig. 6 Simdata6: (a) TCLUST results with c = 1.5 and a = 0.1; (b) TCLUST results with c = 500 and 
ol = 0.1; (c) Trimmed CWRM fitting results with cx = Ce = 1.5 and a = 0.1. 


with Vi) = [^(1 ~ ct)] and imposing a constraint on the variances of the 

error terms for 1 < pi, 52 < G. Notice that, in this case, the distri¬ 

bution of X is not taken into account, hence no trimming related to the X model is 
considered. This stra ightforward robust extension will be referred to as t rimmed Mix¬ 
ture of Regressions ( Nevkov et ah . 2007 : Garcfa-Escudero et all 2010ll . Apart from 
the constraints, this approach reduces to the traditional Mixture of Regressions when 
a = 0, a nd leads back to the widely-ap plied Least Trimmed Squares (LTS) method 
(see, e.g., Rousseeuw and Lerovlll98^ when G = 1 and a > 0. It protects against 
large values of {yi — — 6 °)^, hence it is useful to cope with many cases of data 

contamination which cause the parameters bg “breakdown”, in absence of trimming. 
However, it does not prevent the model estimation from the effects of “bad” leverage 
points, due to outliers in x. As it happens in ordinary least squares regression, a few 
bad leverage points could provoke very disappointing results. 
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For instance, consider the simulated datasets Simdata? and SimdataS in Figure 
|7] Both datasets are made of 180 observations drawn from a CWM with two groups 
and with 20 noisy observations generated by two different contamination mecha¬ 
nisms. The leftmost panels in Figure|7](a) and (d) show the results of fitting the stan¬ 
dard CWM; the central panels (b) and (e) concern trimmed Mixture of Regressions 
{a = 0.1) and, finally, the rightmost panels (c) and (f) illustrate the proposed trimmed 
CWRM (a — 0.1). We can see that the fit of the standard (untrimmed) CWM is 
strongly affected by the contamination. Trimmed Mixtures of Regression are able to 
resist the type of contamination in (b) but cannot afford outliers acting as bad lever¬ 
age points, as in (e). On the other hand, the use of trimmed CWRM, as shown in (c) 
and (f), resists both types of contamination. To avoid an unfair comparison, we have 
not included remarkable differences in the X distributions for the two main groups 
(i.e., prior to contamination), but we can see in Figure[T]how the trimmed CWRM is 
able to deal with components having different marginal distributions. 



Fig. 7 Simdata? in the upper panels (a)-(c) and SimdataS in the lower panels (d)-(f). (a) and (d) fitting 
the (untrimmed) CWM; (b) and (e) fitting trimmed Mixture of Regressions; (c) and (f) applying trimmed 
CWRM including a 10% of contamination. In particular, a = 0.1 and = 20 are used in (b), (c), (e) 
and (f), while cx = 20 is used in (c) and (f). 


The problem of leverage points has been addressed in Robust Regr ession by 


down-weighting influential observations as, for instance, GM-estimators do ( Krasker and Welschl 
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1992 ). In the context of clusterwise regression, Garcia-Escudero et al. (2oT^ pro 


posed a “second trimming”, by fixing two trimming parameters a\ and a^. Parameter 
a\ controls the effect of outliers corresponding to large values of [yi — 
while OLi aims at controlling leverage points corresponding to outlying values on x. 
However, the distinction between these two types of outliers is not always so clear. On 
the other hand, the unified handling of outliers provided by the trimmed CWRM si¬ 
multaneously deals with both types of outliers. As the probability to belong to a clus¬ 
ter is not a fixed value, tt^, but depends also on the CWM weight (j)d{xi, Hg, Sg)'Kg, 
trimming acts before on points that lay on the farer contours of equiprobability (i.e. 
sets of points where the p.d.f. of the mixture takes a constant value) from the clus¬ 
ter means. We are assuming that outliers are the points (x^, with lower values of 
D{x.i,yi; 6), rather than points with greater vertical distances (y^ — b^x^ — 6°)^. 

Other alternatives to guard CWM against contami nation are based on th e con¬ 
sideration of f-distributions, instead of normal ones, see Ingrassia et al. ( 2012h . They 
provide a clear robustness gain with respect to the Gaussian CWM. However, without 
trimming, one single observation placed in a very remote position can still be very 
harmful. In fact, we can make some components of bg to be arbitrarily large or small, 
just by moving one single observation. A small positive fraction of pointwise contam¬ 
ination can be very dangerous too, even when it is not distant from the data. On the 
other hand, the trimmed CWRM is more resistant to extreme contaminations, because 
it does not make any assumption about how outliers have been generated. Therefore, 
rather structured sources of outliers (and clearly not generated from a f-distribution) 
can be handled, too. 

Several methods can be also found in the literature aimed at robustifying the Mix¬ 
tures of Regressions model. Apart from those based on trimming th at have been pre- 
viously cited, methods based o n M-estimation have been proposed iniBai et al.l ( 2012 ) 
and extending S-estimation in Bashir and Carter ( 2012 ). Song et air( 20 4]) propose 
to model the error terms by a Laplace distribution, while Yao et al.r ( 2014 ) suggest to 
employ the t distribution. Although all these methods improve the robustness of the 
model, they do not model the marginal X distribution. Therefore, they do not take 
advantage of this information to detect the different mixture components and hence 
are not able to cope with outlier s both on x and on y, acting as bad leverage points. To 
overcome this issue, lYao et al.l (120141) have recently proposed applying their robust 
Mixture of Regression after using a trimming procedure (with high breakdown point) 
which removes clear outliers on x. This initial trimming is unfortunately done with¬ 
out considering the Y variable, nor the joint distribution in (X, Y), corresponding 
to the different mixture components. The MCD estimator, considered for this initial 
trimming, is aimed at working on a single contaminated population and can be trou¬ 
blesome for detecting outliers when the data set includes different subpopulations. 

In most of the applications, the true contamination level is unknown. Therefore, 
it makes sense to consider a preventive (higher than needed) trimming level a. This 
could lead to wrongly trimmed observations, but the “cores” of the clusters and sen¬ 
sible approximations of the regression lines are most of the times correctly found. 
Starting from them, it is not difficult to recover wrongly trimmed observations, by 
resorting to Mahalanobis dis tances and diagnostic regression tools (see Section 7 in 


Garcfa-Escudero et al. . 20101). 
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5 Real data examples 

5.1 Tone data 


This d ata set comes from an experiment in music perception introduced in ICohen 
(119841) w hich has been an alyzed in many papers co ncerning Mixtures of Regression , 
(see, e.g. de_Veaux , ) and their robus t versions (Schlittgen , 201T1^ Hennig , 2002 ; 

Bai et al. . l2012t Bashir and Carteil 2012 : Song et al. . 2014; Yao et al. . 2014 ). This 
data set is shown in Figure [S^a) and the result of applying the trimmed CWRM in 
(b). We can see that the two main groups (interval memory judgement and partial 
matching) can be detected by applying the trimmed CWRM. Furthermore, a = 0.05 
allows to detect a fraction of outlying observations, within the partial matching group, 
exhibiting a clear different behavior. 



Fig. 8 Tone data: (a) Data set; (b) Trimmed CWRM fitting with a. = 0.05 and cx = Ce = 20. 


The type of outliers included in this data set are not very harmful and, thus, no dra¬ 
matic differences can be expected in terms of the estimated parameters, when using 
any (robust) Mixture of Regressions approach. So, we will proceed to artificially con¬ 
taminate the data and use it as a benchmark for the effects of leve rage point s adde d 
through pointwise contamination. This has been already done by iBai et al.l (12012h . 
who introduced a 6% of contamination at (0,4), when applying an M-estimation ap¬ 
proach. In our case, we will use a more complete contamination scheme by adding 
9% of point contamination, placed around points (2.5,5), (6,4), (0,0.5) and (5,2.5), 
successively. The first location, (2.5,5) is a regression outlier, while the remaining 
three are leverage points. 

Table [T] summarizes the performance of the proposed trimmed CWRM and the 
trimmed Mixture of Regressions (trimmed MR) presented in Section l4~2l both with 
an a = 0.1 trimming level, for different values of the constraints factors cx and 
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Contamination 

location 

Trimmed CWRM 

constants 

Discarded 

outliers 

Trimmed MR 

constants 

Discai'ded 

outliers 

(2.5,5) 

Cx =Ce = 1 

Yes 

Ce = l 

Yes 


cx =Ce = 10^ 

No 

Ce = 10® 

Yes 


Cx=Ce = lOlO 

No 

Ce = 10“ 

No 

(6.4) 

cx =Ce = l 

Yes 

Ce = 1 

No 


Cx =Ce = 10^ 

No 

Ce = 10® 

No 


Cx=Ce = IQlO 

No 

Ce = 10“ 

No 

(0,0.5) 

cx =Ce = l 

Yes 

Ce = 1 

No 


Cx =Ce = 10® 

Yes 

Ce = 10® 

No 


cx =Ce = 10^° 

No 

O 

o 

II 

U) 

No 

(5,2.5) 

Cx =Ce = l 

Yes 

Ce = 1 

No 


cx =Ce = 10® 

No 

Ce = 10® 

No 


Cx=Ce = 10“ 

No 

Ce = 10“ 

No 


Table 1 Tone data: Performance comparison between the trimmed CWRM methodology and trimmed 
Mixture of Regressions (trimmed MR) with an a = 0.1 trimming level. 


Ce, and labeling by “Yes”/“No” the cases in which the trimming level allows/does 
not allow to discard all the noisy observations. We can see that only the use of the 
trimmed CWRM with a = 0.1 and with both constants fixed at their most restrictive 
values is able to cope with the contamination in all the considered scenarios. 


5.2 Students’ heights and weights 

The data set in this example is based on students answers to a questionnaire including 
simple questions about anthropometric measurements. Due to the way in which the 
dataset has been collected, it contains outliers, as some students did not seriously 
answer the questions, or gave bad interpretations of the measurement units, etc. Here, 
we focus on the relationship between two variables in the data set, namely “Height” 
(X) in cm and “Weight” {Y) in Kg. Although gender was also considered in the study, 
we will ignore it, to test the ability of our methodology to classify the individuals and 
to estimate the two underlying regression models, one for each gender, in presence of 
an important amount of severe outliers. 

Figure|9la) shows the original data set (which will be referred to as Student data) 
with the true gender assignments, while in (b) we have eliminated the points corre¬ 
sponding to a wrong scale in height (students reporting height in meters instead of 
centimeters), to emphasize the different linear patterns. Several implausible weight 
values can be also seen. Figure |9lc) shows the results corresponding to the fit of the 
CWM (when a = 0 and Cx = = 10^°, i.e., no trimming and almost unrestricted). 

We can see that one of the regression lines is capturing the artificial group, almost 
collinear, having anomalous height values. Consequently, the main groups are joined 
together and the classification error rate is very high. On the other hand. Figure |3e) 
shows the result of applying the trimmed CWRM with a = 0.1 and moderate values 
of the constraints. Restrictions now avoid that the method falls into the previously 
obtained spurious solution, generated by the almost collinear outliers (wrong mea¬ 
surement units) and these points are trimmed off, together with other data points 
exhibiting atypical weight values. The classification error rate for untrimmed obser- 






20 


L.A. Gai'cia-Escudero et al. 



Fig. 9 Student data: (a) “Students’ heights and weights” data, (b) Cleaned data set obtained by deleting 
the outliers due to wrong measurement scale for “height”. Effects of trimming and restrictions on CRWM 
results: (c) untrimmed and almost unrestricted: o; = 0 and cx = Ce = 10^^; (d) untrimmed and almost 
unrestricted: a = Oandcx = Cs = for the cleaned data set; (e) trimmed and constrained: a = 0.1 
and Cx = Ce = 20; (f) trimmed and constrained: a = 0.04 and cx = = 20 for the cleaned data set 


vations is just 12%. Figures|3d) and (f) show the data set after eliminating the points 
with wrong units for the height. In Figure |9jd), we can see that using the CWM, 
even in this cleaned data set, again fails to detect the true groups. On the contrary, 
we can see in (f) that the trimmed CWRM with a = 0.04 and moderate values of 
Cx and Ce provides sensible results. It is true that simple visual inspection could have 
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served to “clean” this data set but this is surely not the case when dealing with more 
complex/high dimensional data sets on when carrying out fully unsupervised data 
analyses. 


6 Concluding remarks 


The present work is centered on the wide family of Gaussian CWMs, that received 
a growing attention in the recent literature. However, like it happens for many other 
models which depend on normal assumptions, the ML estimation for CWM suffers 
from a lack of robustness. Moreover, the problem statement in terms of the likelihood 
maximization is not well-posed, without constraints. Hence, here we have presented 
a new estimation framework for the linear Gaussian CWM based on trimming and 
constraints, to achieve robustness, identify and discard outliers, circumvent the like¬ 
lihood singularities and reduce the detection of spurious solutions. 

Numerical studies, based on both simulated and real data, show that the new 
proposal drives the estimation procedure to discard even strongly concentrated con¬ 
taminating observations, acting as bad leverage points, which are so harmful in the 
framework of Mixtures of Regressions. Apart from the effectiveness of the proposed 
methodology to resist to any kind of outliers, we have also shown that a theoretically 
well defined mathematical and statistical problem underlies it. The existence of op¬ 
tima for both the population and the sample problem have been established, and the 
consistency of the sample solution to the population one has been provided. 

Further research could be focused on tuning the choice of the involved param¬ 
eters. This is a complex task, as these parameters are clearly interrelated. For in¬ 
stance, a high trimming level a could lead to smaller G values, since components 
with fewer observations may be trimmed off. Moreover, larger values of cx and 
could lead to higher values of G, since more components with few observations, but 
close to collinearity, may be detected. Our suggestion is that the researcher must 
provide in advance part of these parameters (as a way of specifying the type of 
clusters expected from the data) and, then, some data-dependent diagnostic can be 
used to make appropriate choices for the rest of parameters. The use of trimmed 


BIG notions (Neykovetall 


2007h or the adaptation of some graphical tools, as in 


Garcfa-Escudero et akl (1201 ll) , can be useful for this purpose. 
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Appendix 

The following section is organized into four parts: part A contains technical lemmas 
useful for the proof of the existence of the maximizer Q for L{6,P) (Proposition 
3.2.1) which is established in part B; part C shows preliminary results needed to 
show the consistency of 6 as an estimator for 6 (Proposition 3.2.2), which is then 
proved in part D. 
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Part A: Preliminary results in view of Proposition 3.2.1 


Four technical lemmas will be needed before attacking the proof of Proposition 3.2.1. 

First of all, let us remark that, given the dehnition of L{6,P), there exist se¬ 
quences {6n}5^i with 


O _ ,,n T-in 

"n — , ..., TTq, /i,]^ , ..., jlQ, 2 ji, ... 


and 6n G Ocx,ce ^nd such that 


rin L0,n 

Zjq^Oi j, 

( 11 ) 


lim L{6n,P) 

n—^oo 


sup L{6,P) > —oo 
OeOax.c, 


( 12 ) 


(the boundedness from below is obtained just by considering the set A as being a 
ball centered at (0, 0) with P\A\ > 1 — a, tti = 1, = 0, Si = Id, = 0 and 

bi = 0). 

The proof of the existence will be done by proving that we can obtain a convergent 
subsequence extracted from {0n}^=i satisfying (fT2l i. and whose limit 9q is optimal 
forP. 

Let us begin with Lemma[T] which provides a uniformly bounded representation 
of the regression coefficients, even in case of local collinearity, without loosing their 
properties in the evaluation of the target function. 


Lemma 1 Let be a sequence in R, be a sequence in and 

be a sequence of sets in R'^^^ verifying 


limsupP[A„] > 0 

n 


(13) 


and such that 


\imsnpEp[\bl + h'^X-Y\^lAjX,Y)] < oo. (14) 

n 

Then, we can extract subsequences {bn^l^i and them 

and define new sequences {dk}'^i and {Dk}^i which satisfy Dk Q 

P[Anf, \ Dk] —S' 0, —)• G R, d„^ —5> d G R'^ and such that 

{}t^YKj.-Y)lD,{X,Y) = {dl + d'^X-Y)lD,{X,Y), P-a.s., (15) 

for every k >1. 

Proof: To simplify the proof, w.l.o.g., we will use the same notation for the sub¬ 
sequences as that used for the original sequences. If the sequences {bn}ff=i and 
{brtjj^i are bounded, then we just need to extract convergent subsequences and set 
Dn = An. So, let us assume that either one or both sequences are unbounded, and 
consider a sequence of compact sets {Kn}'ifLi such that j" R‘^+^. Let {v„, }f^i be 
the normalized eigenvectors obtained from the spectral decomposition of the matrices 
{Varp[X/A„niL„]}5^]^ (we use Ep[-/A] and Varp[-/A] for denoting Pp[-/(X, Y) G 
A] and Varp[-/(X,y) G A]). 
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Now, let us suppose that there exists a direction v„, such that Varp[v^jX/A„ fl 
Kn] 0 then take H with 0 < H < d and such that Varp[v^jX/A„ fl X„] —>■ 0 
for every I > H + 1, after a possible reordering of the coordinates. In this case, 
there also exist points in and a sequence 4 - 0 which must satisfy 

EpW'v'n^ (X — u„,)I > Sn/An fl X„] ^ 0 for every I > H + 1. The v„, are bounded 
(unitary vectors) and the u„, must be bounded too (because, otherwise, X would not 
be tight). Therefore, there exist subsequences, that will be denoted as the original 
ones, such that v„, —>■ v; S u„, ^ u/ S and P[|v!(X — u;)| > 0M„ n 
Kn] —?> 0 for every I > H + 1. 

Let us now define = An H Kn ~ u;) = 0} which trivially 

verifies Dn C An and that P[An \ Dn] —>■ 0. We can rewrite 

H d 

bn + K^ = b°n+'^h'nVlv'iX+ ^ KvjvJx. 


and set = 6° + bJjW and d„ = Ylf=i foTH>Q (while we set 

d„ = 0 when H = 0). Then (fTsT i trivially holds and it can be shown that {d° 
and are bounded sequences. This follows from the fact that (fT4ll guarantees 

that {(5° + b;x - y)/p„ (X, r )}-1 is a tight sequence. Notice that we could see 
that the previous tightness property would be contradicted if any of the 

were unbounded by seeing that Z = {Zi, ..., Zp) with Zi — vjx satisfies 
det(Varp[Z/A„ n Kn]) > 0 and d'„x = Ya=i 

Finally, whenever none of the sequences Varp[v4jjX/A„ fl X„] converges to 0, 
we can consider the representation 6° + b^jX = 6° + J^iLi b^jV/VjX and the result 
would be proven in this case, too, following similar arguments as before. □ 

The following Lemma |2] assures that, under the usual assumption on P, the as¬ 
sociated fitted trimmed CWMs could not be arbitrarily close to a degenerated model 
concentrated on G points, nor on G regression hyperplanes. 


Lemma 2 Let P be a distribution in satisfiying (PR).' 


(a) For every 6 ° G K, b^ G and A C with P[A] = 1 — a, there exists 5 > 0 
such that 


Ep 


min |60 + b;x-r| 2 /^(x,r) 
3=1,...,G 


> ( 5 . 


(b) For every set of Gpoints {pi ,..., He} L and A C R'^+i with P[A\ = 1 — a, 
there exists (5 > 0 such that 


Ep 


min IIX 
3=i,...,g" 


M,f/A(X,y) 


> 5. 


Proof of (a): Let us suppose that 6 does not exist. Then, we can choose sequences 
1 , and {b ^-1 such that 


Ep 


min 
9=1,...,G 


\bO n ^ _ 


(x,t/) 


—>• 0 with P 4 „] —>• 1 — a. 


(16) 
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Moreover, we can replace the sets An in (fTSl l. by the data sets 

A*n = {(x,y) : ^ + (b^Yx - yj^ < min{r2,s}}, 


where r” = inf„{P[(x, y) : ming=i_..._G + (b”)'x — yY < u] > 1 — a} and 
we also have the same convergence as in (fl 6 l l. with P[v4*] 1 — a for any fixed 

choice of £ > 0. Then, take 


A 


n 

9 


|(x,2/)GylM6°’" + (b^)'x-y| 


min 


| 6 °’" + (b")'x- 



and, we can see that there exists at least one g such that pg > 0 through 

a subsequence (because P[A*] = X]g=i g 1 — a). Thus, consider a 

reordering of {1, G} such that P[Ag] pg > 0 for every g G {1,7J} (for an 
appropriate subsequence, if needed). If A** = then 


Ep 


min 

g=l,....G 


| 6 °'" + (b-)'X-r| 2 /^..(X,r) 


H 




3=1 


| 6 °-" + (b-)'x-rp/^.(x,y) 


and P[An*] —>• 1 — a. For every g G {1, the Ag, 6°’” and b” satisfy the 

conditions needed to apply Lemma[T|and, therefore, we can replace them by d®’” 
and ^satisfying c A^, PlA^ \ D”] ^ 0, d°’" d° G R and d” ^ d° G 

and (fT^ . 

Now, take = Ug=p...^pD^n{(x,y) : ming=i,,„,G Mg” + (dg)'x-2/P < ej 
for a fixed £, with 1 — a. We thus have the pointwise convergence 


min + (d^'x - yj Ib„ (x, y) min |d“ + (d“)'x - y\ Ib„ (x, y), 


for any Bq C with P[Bo] = 1 — a, and the uniform bound ming=i h Idg" + 
(dp'x-yp Ib„ (x, y) < e. Then, the dominated convergence theorem implies 


Ep 


min 

g=l,...,H 


|d°’" + (d^)'X-y|2/BjX,y) 


—> E„ 


min |d“ + (d“)'X-y|^JB„(X,r) 

g=l,...,ff " " 


The latter convergence and (fTST l would prove that 


Eri 


min |d, 


,, + (d°)'x-y|2/B„(x,r) 


= 0 , 


implying that the distribution P is concentrated on G regression hyperplanes after 
removing a proportion a of the probability mass and this would contradict (PR). 
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Proof of (b): The proof of this results mimics the steps followed in the proof of (a). 
We start by assuming the existence of subsequences {An}^^i and such 

that 


Ep 


rnin ||x- /^”f 
g=l,...,G 


—>• 0 with P[^n] —>■ 1 — a. 


and we would end up by seeing that the support X is concentrated in G points in 
In fact, the proof is easier because only the tightness of P is needed (Lemma[T]is no 
longer required, here). □ 

Now, since [0,1]*^ is a compact set, we can trivially choose a subsequence of 
{dn}^=i such that TTg ^ TTg G [0,1] for 1 < p < G. With respect to the scatter 
matrices and the variances of the error terms, we have the following possibilities; 


(51) X” —> Sg for 1 < p < G with Sg being p.s.d. matrices 

(52) min min Xi{Sg) —> oo 

g—l,...,Gl—l,...,d ^ 

(53) max max > 0 

g=l,...,Gl=l,...,d ® 

(VI) cTg’" cTg for 1 < g < G with CTg > 0 
(V2) min erg’" —>■ oo 

(V3) max tTg’" —^ 0 

5=1,...,G ® 


Given that € 0cx,ce^ only one of the convergences in SI-S3 and only one in VI- 
V3 are possible, and the following Lemma [3] will further delimitate to the bounded 
results, based on constraints (5) and (6). 


Lemma 3 If C 0cx.ce converges toward the supremum of L{9, P), and 

(PR) holds for P, then only convergences (SI) and (VI) are possible. 

Proof: We have that L(dn', P) can be bounded from above by 


1 

2 


log 




-f 


Ep [ min, + (b^)'X - V)] 


maXfl a, 


2,n 


9 ^5 


log! mmmmXi{E^)]P[A{ 9 n)]d + 


Ep[mmg\\X- 

max, max; A;(X'") 


where G is a constant value, not depending on 

Therefore, given that G Ocx,cc^ we see that the possible convergence of 
L{9n\P) would clearly depend on those for the sequences 


iog( ^jp[2i(0„)] + £;p 


min|6°’" + (b^)'X-ypJ^(0^)(X,r) 


4 ( 17 ) 

rr^ 


and 


log 


(^) 


P[A{ 9 rf)]d + Ep 


min ||X-/x^f (X,F) 


(18) 


+G, 
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where A„ = maxg=i,,„,G and cr^ = maxg=i,..._G CTg’”. 

On the other hand, Lemma|2]implies that a constant A > 0 can be chosen such that 
Ep[ming |6°-" + (b^)'X-y|2/^„(X,y)] and£;p[ming ||X-/Xgf/A„(X,r)] in 
(Ell and (fTSl l are uniformly bounded from below by S. Therefore, other convergences 
different from (SI) or (VI) would imply that lim„_j.oo P) = —oo and this 

would contradict (El*- 

Lemma 4, stated below, shows that we can always find a subsequence {dn}^=i 
with converging parameters for at least one mixture component, with weight tt” con¬ 
verging toward a strictly positive value. 

Lemma 4 There exists a sequence converging toward the supremum of 

L{6, P) and there exists H with 1 < H < G such that 

fig ^ fig, 6°’” ^6°, b”-^bg and tt” ^ TTg > 0 for every g<H 

and such that the corresponding {A{6n)}^^i sets are uniformly bounded. 

Proof: Let us start from any converging toward the supremum of L{9, P), 

and take An = A{6n) and 

= {(x,y) e An : Dg{x,y-e) = ^ max^Dg{x,y; 6)} 

for 1 < g < G. Since P[Ag] G [0,1], there exists a subsequence, denoted as the 
original one, such that each P[^g ] converges for 1 < g < G. Moreover, after a proper 
reordering in the components of 0„, there exists H* > 1 such that P[^g] —>■ Pg > 0 
for 1 < p < H*. Note that this H* does exist because otherwise we would have 

mn]=E%P[A^g]^0- 

We can also find a convergent subsequence of fig for every g < H* . Otherwise, 
for every g with 0 < g < pg, we could take a ball Bg centered at (0, 0) with P[Bg] > 

1 — Pg + g and such that there exists no with P[Pg n Ag] > g/2 when n > uq. 
Consequently, we would have Pp[IIX—/Xg IP/aj] > Pp [||X—/r” ip/p^nAj] —>• oo 
which contradicts (fT^ . Note that the contributions of the other terms to L(0„, P) are 
controlled, because of Lemma|3l 

From (El*, we have limsup„ Ep [lAg’" + (bp'X — ypJA^(X, F)] < cx). This, 
together with the fact that limsup„ P[^g] = Pg > 0 for g < H*, allows us to apply 
again Lemma [Tito replace the {6°’”}, {bg} and {^g} sequences by appropriated 
convergent sequences jdg’"}, (dg } and {Pg }■ These convergences also trivially im¬ 
ply that TTg —)• TTg > 0 for g < H*. 

Other g values could also satisfy these convergences (through subsequences and 
possible alternative representations). In this case, we consider El > H* such that all 
the convergences in the statement of this Lemma hold for g < H. 

To see that the are uniformly bounded, recall that A{6n) = {(x, y) : 

D{x,y,en) > R{9n, P)} and let us introduce 

P(0„, P) = sup I P max Dg(K,Y;9n)>u 

u { [ t<g<H 

Given that P(x, y; 0„) > maxg Pg(x, y; 0„), we trivially have the bound P(0„, P) < 
P(0„, P). Moreover, tt”, /x”, X", 6°’”, b” , fg’" are convergent sequences when g < 
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H and, then, we can also find a strictly positive constant Rh satisfying 0 < Rh < 
Rid„, P) < R{0n, P)- The sets = {(x, y) : maxg<_y Z)g(x, y; 0 „) > Rh} sat¬ 
isfy that An C Bn and all these Bn sets are uniformly bounded just by taking into ac¬ 
count the uniform continuity of the set functions {(x, y) i—>■ maxg<// Dg{x, y, 9n)}^=i 
and that the parameters corresponding to the first iJ groups in {6n}^=i are uniformly 
bounded. □ 

Having established these crucial findings, we are ready to prove the existence 
result. 


Part B; Proof of Proposition 3.2.1 


Let us start from a sequence {6n}}^=i converging toward the supremum of L{6, P). 
Thanks to Lemma |2] we know that there exists a subsequence of {0n}!^=i with 
Sg and Cg’" —Ug for 1 < 5 < G. Moreover, by applying Lemma |4l a 
further subsequence (with a proper modification, if needed) can be obtained that also 
verifies /Xg —>• /Xg, fig’" — 5 > 6 °, bg — 5 > bg and TTg —TTg with TTg > 0 for any g with 
g < H and 1 < H < G. Let us assume that there exists some g such that /Xg is 
not bounded, or such that a bounded representation for 6 °’” and bg (in the sense that 
limsup„ Gp [| 6 g’" + (bg)'X — Fp/A„(X,y)] = 00 ) does not exist. We will see 
that we necessarily must have that tt}} 0 and, consequently, the role played by 
/i.g, 6 g’" and bg is irrelevant, given that they do not modify the value taken by the 
target function. Therefore, we could modify them by using other arbitrary convergent 
parameter values (of course, satisfying the desired constraints) and the proof would 
be done. 

To prove that, let us consider 


Mn = EH ( log(^^f?g(X,F;0„)^ -log(^^i^g(X,F;0„)^^ IaJX,Y) 

By considering the same Rh > 0 used in the proof of Lemma |4] and the fact that 
log(l + x) < X, we can see that 


G 

Mn< Y. 

g^H+1 


Ep 


Rh 


-IaAX,Y) 


Then, it is trivial to see that 0 when /ig is not bounded or when no bounded 

representation for 6 °’" and bg exists for any g > H. Consequently, if tt^ ^ TTg > 0 
for any g > H and 0* is the limit of the subsequence {tt”, ..., 7 rj^, /r",..., X",..., 

6 °,bj*,...,b^,(Ti’”,..., 1 , we would have that lim^^oo sup L(; P) 

= L{0*; P) (because ^ 0) with < 1- Then, we could define a new sub¬ 


sequence {0n}?}Li = {^ 1 , ^1 > X!g, bp, ..., bp, bj*,..., bg,, 

Pp,...,d-p}Pi with 


2 ^ 0=1 


fori < g < H and 


= ... = i^g = o, 


^H+l — — 
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^0,n ^ 


b” = b" 

’ g g- 


^g = 


S^e^ndaf- 


= fori < g < H 


with fig = fig,, g 

and parameters arbitrarily chosen when g > H (only satisfying the required con¬ 
straints). We hnally could see that lim„_>oo sup i(0„; P) < limn,_>.oo sup L{9n', P) 
and this would contradict the optimality stated in the hypothesis of the present lemma. 

□ 


Part C: Preliminary results in view of Proposition 3.2.2 

Before starting the proof of the consistency of the solution for the sample problem to 
the population solution, we introduce some notation, and state some useful results. 


Let{0„}“i = {7 


1 5 '‘G’ 


•"•71 •'71 ^ 7 I 

Ml f ••■1 MG’ f ■•■5 ^G’ ^ 


0,71 


£0,71 , n 
, L^g 7 >-*1 5 


b r} '< 2,71 


®cx,ce denote a sequence of empirical estimators obtained by solving 
the empirical problems dehned from the sequence of empirical measures {Pn}^-^. 

First, we prove that there exists a compact set K C 0cx ,ce such that On & K with 
probability 1. This is done through Lemmas |5] and |6] whose proofs are quite straight¬ 
forward adaptations of the previously given proofs of Lemmas [T]|2]|3] and 0] In those 
adaptations, appropriate Glivenko-Cantelli class of functions must be considered and 
the class of balls in (which is a Glivenko-Cantelli class too) is taken to provide 
bounding compact sets when needed. 

Lemma 5 If P satisfies (PR), then only convergences (SI) and (VI) are possible for 
the 2 j ^ ’s and 's. 


Lemma 6 If (PR) holds, then we can choose a sequence solving the empir¬ 

ical problem with components fi^, bg'^ and hg such that their norms are uniformly 
bounded. 


The following two lemmas are the analogous to Lemmas 5 and 6 in iGarcfa-Escudero et al 
( 2014bl) . Their proofs mimic the same steps, with the only reformulation of the 
£)(■; 9) functions, which here take into account the conditional distribution on the 
Y variable. 

Lemma 7 Given a compact set K C 0cx,ce’ B C and [a, b] C K, the class of 
functions 

n ■.= iylB{-)I[u,oo){D{-,9))\og{D{-,9)) 9 & K,u & [a,h]^ (19) 

is a Glivenko-Cantelli class. 

Lemma 8 Let P be an absolutely continuous distribution with strictly positive den¬ 
sity function. Then, for every compact set K, we have that 

sup \R{9, Pn) — R{9, P)\ 0, P-a.e.. 

9&k 

In fact, the condition on the existence of a strictly positive density function for P 
can be r emoved, but this would imply the use of trimming functions as those intro¬ 
duced in ICuesta-Albertos et al.l (Il997h . 
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Part D: Proof of Proposition 3.2.2 


Taking into account Lemma|2l t he consistency follows from C orollar y 3.2,3 in van der Vaart and Wellnei 
( 199^ exactly as it was done in iGarcia-Escudero et al, ( 2008 ) and in Garcia-Escudero et al 


( 2014bh . Note that Lemmas |5] and guarantee the existence of a compact set K 
such that {6n}^^i is included in K with probability 1 and R{9n,Pn) is also in¬ 
cluded with probability 1 within an interval [a, 6] due to Lemma 0 This has been 
also used to simplify the target fu nction needed to apply the aforementioned result in 
van der Vaart and Wellneil ( 1996h . □ 
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